Back

Genome Biology

Springer Science and Business Media LLC

Preprints posted in the last 7 days, ranked by how well they match Genome Biology's content profile, based on 555 papers previously published here. The average preprint has a 0.30% match score for this journal, so anything above that is already an above-average fit.

1
DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv
Top 0.1%
14.4%
Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

2
CHORD: a framework for cross-species single-cell integration across gene, cell and cell-type levels

Lin, Y.; Zhu, X.; Zhou, X.; Zhang, X.; Cai, G.; Zhao, W.; Zhou, J.; Liu, J.; Zhu, Q.; Zhang, M.; Zhou, B.; Gu, X.; Zhou, Z.

2026-04-22 bioinformatics 10.64898/2026.04.19.719426 medRxiv
Top 0.1%
14.2%
Show abstract

Quantifying cross-species relationships among cell types from single-cell transcriptomic data can reveal both conserved and divergent patterns of cell-type hierarchies. However, existing cross-species integration methods can be limited in modeling genes beyond orthologs by leveraging cell-type-resolved transcriptional context, or in learning explicit type-level representations. Here we present CHORD, a cross-species integration framework that jointly learns representations of genes, cells and cell types. We demonstrate that CHORD can integrate cross-species single-cell atlases and support cell-type annotation with unknown cell-type detection. In the frog-zebrafish embryogenesis and mammalian motor cortex atlases, CHORD infers cell-type trees that place conserved cell types from different species in relative proximity and summarize hierarchical relationships among cell types. CHORD also supports cross-species comparison of continuous phenotypic variation by placing embryonic cells along an aligned developmental timeline. CHORD further yields gene embeddings that capture orthologous and functional relationships, and gene importance scores linking genes to cell types.

3
scSketch: Interactive Sketch-based Trajectory Exploration and Pathway-Aware Analysis of Single-Cell Data

Temirbek, A.; Lekschas, F.; Sankaran, K.; Colubri, A.

2026-04-21 bioinformatics 10.64898/2026.04.16.718997 medRxiv
Top 0.2%
12.6%
Show abstract

Interactively exploring gene expression gradients across low-dimensional cell embeddings is central to single-cell RNA sequencing analysis, yet there arent tools that allow users to sketch trajectories and interactively compute pathway-level interpretation. We present scSketch, a tool that enables users to iteratively explore and test trajectory hypotheses in single-cell data while maintaining statistical validity and biological interpretability. Specifically, users apply interactive directional sketching to draw trajectories across embeddings and probe continuous processes such as cellular differentiation and cell state transitions. scSketch automatically computes gene-trajectory correlations and applies online false discovery rate (FDR) control to maintain statistical validity during iterative exploration. Significant genes are grouped into Reactome pathways for contextual interpretation. Applied to human oral keratinocytes infected with human cytomegalovirus, scSketch revealed infection-associated gradients involving interferon responses, metabolic remodeling and autophagy. Together, these features position scSketch as a bridge between exploratory visualization and mechanistic insight in single-cell biology. Pseudocode and full algorithm details for online FDR and interactive directional sketching are available in Supplementary Methods S1 and S2.

4
CRISP enables comparisons of image-based spatial transcriptomicsegmentation quality across ten organs

Rose, J. R.; Rose, E. S.; Assumpcao, J. A. F.; Pathak, H.; Peck, H. E.; Sasser, L. E.; Patel, C. J.; Vanover, D.; Santangelo, P. J.

2026-04-21 bioinformatics 10.64898/2026.04.16.718947 medRxiv
Top 0.4%
10.1%
Show abstract

Image-based spatial transcriptomics depends on cell segmentation to assign transcripts to individual cells, but how segmentation algorithms perform across tissues with distinct cellular architectures is poorly understood. This study presents the broadest independent benchmark to date of cell segmentation algorithms for spatial transcriptomics, comparing five approaches across ten mouse tissues using a 5,006-gene Xenium panel. To quantify segmentation errors, Co-expression Rejection in Segmentation Purity (CRISP) was developed, an open-source tool available in R and Python that measures cell purity through tissue-specific mutually exclusive marker co-expression without requiring ground truth annotations. This benchmark revealed that segmentation algorithms face a fundamental tradeoff between maximizing transcript capture and maintaining cell purity, and that the severity of this tradeoff is tissue-dependent. Proseg achieved the highest average performance across tissues, though the magnitude of its advantage varies with tissue architecture. Overall, CRISP provides per-tissue performance profiles as a practical resource for algorithm selection.

5
PathPinpointR: Predicting the progression of sc-RNAseq samples through reference trajectories.

Nicholas, M. T.; Mehta, D.; Ouyang, J.; Dawoud, A.; Ellison, C.; Westendorf, J.; Green, L. A.; Skipp, P.; Rackham, O.

2026-04-21 bioinformatics 10.64898/2026.04.21.715327 medRxiv
Top 0.4%
10.0%
Show abstract

Single-cell RNA sequencing (scRNA-seq) has transformed our ability to analyse cellular heterogeneity, enabling detailed mapping of cellular progression. Trajectory inference tools construct trajectories from scRNA-seq data, facilitating the tracing of cellular progression through developmental pathways. PathPinpointR (PPR) is a lightweight and user-friendly R package developed to predict and compare the positions of scRNA-seq samples along reference biological trajectories, such as those created from large cell atlas projects. PPR utilises sets of switching-gene events from reference trajectories as indicators of cellular progression. By applying these positional indicators to query datasets, each cell can be accurately assigned a pseudo-time value, providing predictive insight into its position along a trajectory. This information can be used to stage cells within an established developmental process, or to evaluate how different patient samples compare when mapped onto reference disease or drug response trajectories. AvailabilityPathPinpointR is available at https://github.com/moi-taiga/PathPinpointR. Contacto.j.l.rackham@soton.ac.uk

6
Scalable, Generalizable, and Uncertainty-Aware Integration of Spatial Multi-Omics Across Diverse Modalities and Platforms with SCIGMA

Chang, S.; Fleischmann, A.; Ma, Y.

2026-04-22 bioinformatics 10.64898/2026.04.19.718223 medRxiv
Top 0.6%
8.4%
Show abstract

Recent advances in spatial omics technologies have enabled simultaneous profiling of transcriptomic, proteomic, epigenomic, metabolomic, and imaging data at high spatial resolution, offering unprecedented opportunities to dissect tissue complexity. However, integrating these diverse and large-scale spatial multi-modal datasets remains a major computational challenge. We present SCIGMA, a scalable and generalizable deep learning framework for spatial multi- omics integration. SCIGMA introduces a novel uncertainty-aware contrastive learning objective and multi-view graph neural networks to preserve modality-specific signals while learning biologically meaningful joint representations. Unlike existing methods, SCIGMA provides spatially resolved uncertainty estimates, improving interpretability and identifying regions of biological or technical heterogeneity. SCIGMA is the first spatial multi-omics method to support integration of up to five modalities - as demonstrated on Spatial-Mux-Seq data - and its modular framework is extensible to future technologies with even more modalities. It also scales to over one million spatial locations, enabling analysis of ultra-high-resolution datasets such as VisiumHD and Xenium Prime. We evaluated SCIGMA across 19 datasets spanning 8 modalities, 10 tissues, and 9 platforms. On benchmarkable datasets, SCIGMA outperformed existing methods in spatial domain detection, modality preservation, feature reconstruction, and reproducibility. Across applications, it uncovered biologically meaningful structures, refined spatial domains, and modality-specific regulatory programs, while its uncertainty estimates revealed tissue regions with potential biological or technical variation. Together, SCIGMA provides a robust, flexible, and future-ready solution for scalable spatial multi-modal integration.

7
Novel Parameter-Free and Interpretable Integration of CITE-seq RNA and ADT Profiles via Tensor Decomposition-Based Unsupervised Feature Extraction

Taguchi, Y.-h.; Turki, T.

2026-04-21 bioinformatics 10.64898/2026.04.18.719420 medRxiv
Top 0.7%
8.2%
Show abstract

CITE-seq jointly profiles cellular transcripts and surface proteins, but integrating RNA and antibody-derived tags (ADTs) remains challenging because the two modalities differ markedly in dimensionality, sparsity, and noise characteristics. We present a tensordecomposition-based unsupervised feature extraction framework for the parameter-free integration of CITE-seq data. By constructing a gene x cell x protein tensor and applying HOSVD, this method derives the shared latent representations of genes, cells, and proteins without prior gene filtering or modality-weight tuning. Across five ImmGen T-cell CITE-seq datasets, the resulting cell embeddings were generally more consistent with annotated cell types than RNA-only, protein-only, or totalVI-based embeddings, whereas the organ-level consistency did not improve. The latent factors also enabled post hoc unsupervised gene selection, and the selected genes showed biologically meaningful enrichment for T-cell-related terms. In addition, failure in a poor-quality dataset served as a useful quality-control signal. Together with a blocked sparse-matrix implementation for large tensors, these results indicate that tensor decomposition-based unsupervised feature extraction provides an interpretable, scalable, and competitive approach for integrating RNA and ADT measurements in CITE-seq experiments.

8
Pan1c : a pipeline to easily build chromosome-level pangenome graphs

Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.

2026-04-21 bioinformatics 10.64898/2026.04.17.719212 medRxiv
Top 0.7%
7.9%
Show abstract

The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.

9
REPLAY: A reproducible and user-friendly application for DNA replication timing analysis from Repli-seq data

Dickinson, Q.; Yu, C.; Rivera-Mulia, J. C.

2026-04-21 genomics 10.64898/2026.04.16.719037 medRxiv
Top 0.8%
7.0%
Show abstract

BackgroundDNA replication timing (RT) is a fundamental feature of genome organization that is regulated in a cell-type-specific manner and frequently altered in disease. Repli-seq is the standard approach for genome-wide RT profiling; however, its analysis typically requires multiple independent tools and custom scripts, limiting reproducibility, portability, and accessibility, particularly for users without computational expertise. In addition, existing workflows often lack standardization and require substantial user intervention. ResultsWe developed REPLAY, a fully automated, reproducible, and user-friendly application for replication timing analysis. REPLAY is distributed as a standalone executable that enables end-to-end processing from compressed FASTQ files to genome-wide RT profiles without requiring software installation or programming experience. Through an intuitive graphical interface, users can configure analysis parameters, including input and output directories, reference genome, normalization strategy (quantile, median, or interquartile range), and smoothing. The application integrates all processing steps--quality control, trimming, alignment, binning, RT log2 calculation, normalization, smoothing, and visualization-- within a single automated workflow. Application of REPLAY to publicly available datasets demonstrate accurate reconstruction of RT profiles and high reproducibility across samples. ConclusionsREPLAY offers a portable, reproducible, and accessible solution for the analysis of RT data. By eliminating the need for command-line tools and complex installations, it lowers the entry barrier enabling standardized analysis across diverse research settings.

10
Robust causal gene network estimation for large-scale single-cell perturbation screens using reduced control function

Ge, C.; Li, H.

2026-04-21 bioinformatics 10.64898/2026.04.20.719759 medRxiv
Top 1%
6.4%
Show abstract

Single-cell CRISPR perturbation screens offer a powerful framework for causal discovery in gene regulatory networks, but existing methods struggle with high-dimensional count data, unmeasured confounding, and the increasing prevalence of high-multiplicity-of-infection (MOI) designs. We introduce RICE, a scalable framework for causal gene network estimation that integrates a reduced control function to address latent confounding with a constrained generalized linear model accommodating both hard and soft interventions. By enforcing differentiable acyclicity constraints, RICE enables efficient GPU-based optimization for large-scale data. Across synthetic benchmarks, RICE achieves higher accuracy and robustness than existing methods and remains stable under strong confounding and high-MOI settings. Applied to multiple single-cell perturbation datasets, including CRISPRi screens in K562 and RPE1 cells and a Perturb-CITE-seq data set with CRISPR-Cas9 knockout (KO), RICE recovers biologically coherent networks with edge weights consistent with perturbation effects and enriched for known regulatory interactions. These results establish RICE as a flexible and scalable approach for causal discovery in modern single-cell perturbation studies.

11
Functionally informed cis and trans proteome-wide association studies prioritize disease-critical genes

Hou, K.; Pazokitoroudi, A.; Strober, B.; Jiang, X.; Price, A. L.

2026-04-27 genetic and genomic medicine 10.64898/2026.04.24.26351667 medRxiv
Top 1%
6.2%
Show abstract

Proteome-wide association studies (PWAS) typically link genetically predicted protein levels to disease using cis-pQTLs, which can be limited by low cis-heritability for disease-critical genes under negative selection and by tagging due to co-regulation among nearby genes. Trans-pQTLs provide complementary information when large sample sizes are available to detect weak polygenic effects, enabling associations between trans-predicted protein levels and disease. We developed PolyPWAS, a functionally informed, summary statistics-based framework for associating both cis- and trans-predicted protein levels to disease. PolyPWAS integrates 96 functional annotations with proteome-wide pleiotropy to improve protein prediction, while correcting for PCs of predicted protein levels to limit tagging effects. We applied PolyPWAS to 2.8K plasma proteins measured in 34K UKB-PPP participants, analyzing GWAS summary statistics for 88 diseases and complex traits (average N=336K). Trans-predicted protein levels explained 21% of disease heritability (vs. 9.6% for cis-predicted protein levels), leveraging a 24% relative improvement in trans-prediction accuracy from functional priors. Trans-PWAS identified more significant protein-disease associations (and more conditionally significant associations) than cis-PWAS. Cis and trans associations showed only modest excess overlap (1.18, 95% CI: 1.11-1.26). Accordingly, combining evidence from cis and trans associations improved disease gene prioritization evaluated using gene sets from rare variant association studies (+11% relative improvement) and PoPS (+7.0% relative improvement) relative to cis-only approaches. PWAS associations to disease replicated across protein level cohorts, with strong UKB-PPP/deCODE concordance after adjusting for cohort-specific prediction accuracy. We provide examples where trans-regulatory effects link multiple disease-critical genes, underscoring the importance of integrating cis- and trans-regulatory effects to map protein-mediated disease biology.

12
A long-read RNA sequencing and polysome profiling framework reveals transposable element-driven transcript diversity and translational rewiring in glioblastoma

Pizzagalli, M.; Sasipalli, S.; Leary, O.; Tran, L.; Haas, B.; Tapinos, N.

2026-04-21 cancer biology 10.64898/2026.04.18.719388 medRxiv
Top 1%
6.2%
Show abstract

BackgroundTransposable elements (TEs) account for over half of the human genome and are often derepressed in cancer. TEs can add cryptic splice sites, undergo exonization, and generate gene-TE fusion transcripts, but the combined effects of TEs on RNA processing and translation in glioblastoma stem cells (GSCs) remains incompletely elucidated. ResultsWe combined long-read RNA sequencing with polysome profiling in four patient-derived GSCs and two neural stem cell (NSC) controls to resolve TE-associated transcript diversity and its relationship to ribosomal engagement. Across GSCs, we identified 13,421 alternative splicing (AS) events, 3,077 of which contained TEs within 150 bp of splice junctions. AS sites proximal to TEs were associated with increased isoform switching compared to non-TE-associated AS sites (odds ratio 2.9 - 4.3). Moreover, AS isoforms generated from TE-proximal sites were more likely to exhibit altered ribosomal association (odds ratio 2.54). Directional shifts were observed, with shorter isoforms associating with monosome fractions and longer isoforms with polysome fractions. To enable systematic detection of gene - TE chimeric transcripts, we developed FuTER (Fusion TE Reporter), a long-read-based framework for identifying TE-associated fusions. Application to GSC datasets identified 78 GSC enriched fusion transcripts, several supported by breakpoint-spanning reads in polysome fractions, consistent with ribosome association. ConclusionsOur data suggest that TEs correlate with abnormal splicing activity and altered ribosome engagement in glioblastoma stem cells. By integrating long-read sequencing with polysome profiling and fusion detection, we establish a framework for analysis of TE-induced transcript diversity and its effects on cancer evolution and plasticity.

13
SpaceBender: Denoising Spatial Transcriptomics Data to Enhance Biological Signals

Chen, D. G.; Ribas, A.; Campbell, K. M.

2026-04-23 bioinformatics 10.64898/2026.04.20.719715 medRxiv
Top 1%
6.1%
Show abstract

Spatial transcriptomics (ST) allows for the simultaneous profiling of cell phenotype (e.g. transcriptome) and physical position. Although ST data has brought about numerous new biological insights, it remains limited by noise, largely in the form of RNA diffusion. Here, we introduce SpaceBender which leverages spatial-specific information (e.g. spatial ambient RNA niches) to build upon single-cell denoising strategies. SpaceBender outperforms current ST denoising methods in simulations and in vivo chimeric tissues. Through case studies, we demonstrate how SpaceBender unveils hidden biological insights and increases the significance of said insights as evaluated by statistical testing. Finally, we reveal how SpaceBender may also be applied to subcellular resolution data where it removes off-target expression of neighboring cell type specific marker genes. In all, we present SpaceBender as an ST denoising method, freely available as an open-source package, that may enhance the insights the field may draw from various ST data types.

14
MutaPhy: A clade-based framework to detect genotype-phenotype associations on phylogenetic trees

Ngo, A.; Guindon, S.; Pedergnana, V.

2026-04-21 evolutionary biology 10.64898/2026.04.19.719535 medRxiv
Top 1%
4.9%
Show abstract

Understanding how genetic variation in pathogens influences clinical phenotypes observed in infected hosts is a fundamental challenge in evolutionary genomics and public health. Phenotypic traits such as infection severity are often non-randomly distributed within the pathogens phylogeny, suggesting the existence of evolutionary determinants but also violating the independence assumption underlying classical genome-wide association studies and potentially leading to inflated false positive rates. We present MutaPhy, a phylogeny-based method aimed at detecting correlations between a binary host phenotype and the corresponding pathogen genome by directly utilizing the hierarchical structure of phylogenetic trees. MutaPhy encompasses three different scales: (i) a subtree scale, on which relevant clades over-representing the phenotype of interest are detected using permutation-based tests; (ii) a tree scale, which agglomerates local signals into a global association statistics; and (iii) a site scale, whereby candidate mutational events on branches leading to significant clades are examined using ancestral sequence reconstruction. We evaluate the statistical behavior and detection performance of MutaPhy using simulations under diverse evolutionary scenarios. We also compare this tool to several existing phylogenetic association methods. As illustrative applications, we apply MutaPhy to dengue virus and hepatitis C virus datasets associated to clinical phenotypes in human hosts. Our results highlight the ability of the proposed approach to detect viral lineages associated to over-represented phenotypes while revealing limited evidence for robust mutation-level associations in these particular datasets. Altogether, MutaPhy provides a framework for guiding genotype-phenotype association analyses by leveraging phylogenetic structure, thereby reducing false positive findings and improving the interpretability of association signals.

15
Benchmarking Agentic Large Language Models for ComplexProtein-Set Functional Annotation

Zhang, X.

2026-04-21 bioinformatics 10.64898/2026.04.18.719404 medRxiv
Top 1%
4.8%
Show abstract

Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein sequences), and the same local evidence: Swiss-Prot BLASTP output, Pfam/HMMER domain hits, DeepTMHMM topology predictions, and SignalP secretion predictions. We audited the nine outputs for coverage, biological correctness, missing evidence, hallucinated or over-specific annotations, and within-method consistency, then merged the best-supported evidence into a final orthogroup annotation table. All nine runs covered all 73 orthogroups, indicating that the agents could retrieve and organize the complete input set. However, normalized calcification-relevance calls were only moderately reproducible: within-method exact tier agreement ranged from 0.397 to 0.685 for Claude App (mean 0.562), 0.342 to 0.740 for Claude Code (mean 0.516), and 0.411 to 0.630 for Codex App (mean 0.539), and the per-run number of high-confidence calls varied from 0 to 12 across the nine runs. The final curated table retained 3 high-confidence, 9 moderate, 18 watchlist, and 43 low-relevance orthogroups. The most robust direct candidates were sulfatase (OG0017138) and sulfotransferase (OG0020703) families and an FG-GAP/integrin-like surface protein family (OG0018986), whereas common error modes included elevating pentapeptide-repeat orthogroups on motif evidence alone, treating weakly secreted housekeeping enzymes as matrix proteins, and taking low-complexity BLAST labels at face value. Skill-enabled agents improved file handling, evidence traceability, and reproducibility of computational checking, but they did not eliminate biological overinterpretation. These results support a best-practice workflow in which LLM agents draft annotations only after deterministic evidence tables are generated, with explicit scoring rules, provenance columns, run-to-run replication, and expert review of high-impact claims.

16
Concordia: Spatial Domain Detection via Augmented Graphs for Population-Level Spatial Proteomics

Liu, S.; Hsu, L.; Sun, W.

2026-04-22 genomics 10.64898/2026.04.19.719422 medRxiv
Top 1%
4.8%
Show abstract

A key step in analyzing population-level spatial proteomic data is to delineate consistently defined spatial domains across samples. Domain detection is particularly challenging for cancer tissues, which have complex spatial domains with elongated or branching geometries. To address these challenges, we present Concordia, a Graph Neural Network (GNN)-based framework that uses augmented graphs to capture complex spatial domains, and it is designed to analyze thousands of tissues simultaneously to obtain consistently defined domains. Applied to a lung cancer dataset, Concordia uncovers a spatially defined cancer associated fibroblast subset linked to clinical outcomes that cannot be identified using protein expression alone.

17
Dissecting the coordinated progression of cell states in spatial transcriptomics with CoPro

Miao, Z.; Qu, Y.; Huang, S.; Laux, L.; Peters, S.; Aristel, A.; Zhang, Z.; Niedernhofer, L. J.; McMahon, A.; Kim, J.; Zhang, N.

2026-04-21 bioinformatics 10.64898/2026.04.17.719309 medRxiv
Top 1%
4.8%
Show abstract

Spatial transcriptomics enables the study of how cells coordinate their molecular states within tissue, providing insight into both normal function and disease processes. A key challenge is to identify gene expression programs that vary continuously across space and are coordinated between cell types. We present CoPro, a computational framework for detecting the spatially coordinated progression of cellular states. CoPro can operate in both supervised and unsupervised modes to identify gene programs that co-vary within or between cell types, and to disentangle multiple overlapping spatial patterns. CoPro can be applied to single-cell-level spatial transcriptomics datasets, including MERFISH, SeqFISH+, Xenium, and histology-imputed transcriptomic data. We demonstrate the utility of CoPro with data collected from colon, brain, liver, and kidney tissues. In the colon, CoPro separates epithelial differentiation along the crypt axis from spatially localized inflammatory signals. In the aging liver, it identifies multiple aging-associated cellular programs superimposed on anatomical zonation. In the brain, the flexible kernel design enables the decoupling of the gene expression gradient along the dorsal-ventral and medial-lateral axes. In the kidney, CoPro identifies tubule-vasculature coordination that is essential in nephron function. These results demonstrate CoPros utility for analyzing spatial coordination of gene expression in complex tissues and disentangling overlapping biological processes, such as anatomical organization and disease-associated variation.

18
HMCVelo: A Deterministic Model for Hydroxymethylation Velocity in Single Cells

Mishra, P.

2026-04-22 bioinformatics 10.64898/2026.04.20.719607 medRxiv
Top 2%
4.8%
Show abstract

I present hydroxymethylation velocity (HMCVelo), the first velocity framework for DNA methylation dynamics. HMCVelo is a deterministic ordinary differential equation (ODE) model that computes the time derivative of hydroxymethylation state for individual cells and genes. The model exploits a recent advance in single-cell epigenomics--Joint single-nucleus hydroxymethylcytosine and methylcytosine sequencing (Joint-snhmC-seq)--which enables subtraction-free quantification of 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) at single-cell resolution, resolving temporal methylation dynamics from static molecular snapshots. HM-CVelo models the methylation-demethylation cycle as three coupled processes--methylation, hydroxymethylation, and demethylation--governed by gene-specific rate parameters estimated at steady state via constrained least-squares regression. Scale invariance reduces the parameter space from three to two free parameters per gene. Applied to murine cortical cells (n = 519 and n = 545), HMCVelo infers cellular trajectories with velocity confidence scores exceeding 0.89 across all cell types, compared to confidence scores below 0.45 when RNA velocity is repurposed on the same data. I further prove that in any closed biochemical system with a conservation law, the complement variable cannot resolve trajectory bifurcations--a result with implications for embedding basis selection in all future velocity frameworks applied to cyclic biochemical systems. This work provides a foundation for multi-omic trajectory inference integrating epigenetic and transcriptomic measurements.

19
Topographical archetypes of somatic mutagenesis in cancer

Lynch, A. W.; Lee, S. S.; Hummel, J. P.; Geiger, B.; Lawrence, M. S.; Jin, H.; Gulhan, D. C.; Park, P. J.

2026-04-21 bioinformatics 10.64898/2026.04.18.719374 medRxiv
Top 2%
4.8%
Show abstract

The genome of every cancer cell carries a record of the mutational processes that have acted throughout its history. Mutational signature analysis, which infers the activity of mutagenic processes from their characteristic base-change patterns, has become an indispensable tool for interpreting somatic mutations. However, this framework captures only which types of mutations a process generates and not where in the genome they occur -- a distribution influenced by replication timing, chromatin organization, transcription, DNA secondary structure, and other genomic features. Here, we present a generative probabilistic framework (MuTopia) that jointly infers mutational spectra and their genome-wide topography as nonlinear functions of genomic and epigenomic state. Applied to whole-genome sequencing data from 15 tumor types, MuTopia reveals that mutational processes fall into eight conserved topographic archetypes, or topotypes, shaped primarily by replication timing and chromatin state. Diverse mutational processes converge upon this limited repertoire, indicating that the genomic distribution of mutagenesis is constrained less by the source of damage than by how that damage is processed. Individual mutational processes exhibit state-dependent variation in their genomic distributions: the same signature can adopt distinct topotypes depending on repair proficiency and replication stress. For instance, SBS8 shifts from a canonical late-replicating profile in homologous recombination-proficient tumors to an early-replicating, stress-associated topotype in HR-deficient tumors, and replication stress similarly reshapes the genomic distribution of APOBEC editing. Topotypes, therefore, provide a classification of mutagenesis distinct from spectral signatures, capturing aspects of tumor biology that spectra alone cannot resolve.

20
scVIP: personalized modeling of single-cell transcriptomes for developmental and disease phenotypes

Lai, H.-Y.; Yoo, Y.; Tjaernberg, A.; Travaglini, K. J.; Agrawal, A.; Kana, O.; van Velthoven, C.; Carroll, J. B.; Qiao, Q.; Mukherjee, S.; Fardo, D. W.; Lein, E.; Gabitto, M. I.

2026-04-22 bioinformatics 10.64898/2026.04.20.717759 medRxiv
Top 2%
4.7%
Show abstract

Single-cell RNA sequencing reveals cellular heterogeneity, but linking cellular states to individual-level phenotypes remains challenging. We present scVIP, a generative framework that integrates transcriptional profiles and phenotypic markers to learn personalized individual-level embeddings using generative models and cell-type-aware multi-instance learning. scVIP predicts developmental age, disease progression, and neuropathology, while harmonizing datasets with distinct phenotype definitions. The model highlights disease-relevant cell populations and transcriptional programs underlying neurodegeneration.